Method for constructing a histogram

ABSTRACT

A method for constructing a histogram can include sampling attributes in a column of a database on a server and determining a bucket set for the histogram based on a number of buckets that represents a distribution of the attributes with minimum error. A bucket in the bucket set includes boundaries and an approximation of a count of attributes falling within the boundaries. The method further includes determining a precision for encoding the approximation, such that the histogram having the bucket set fits within a storage limit on a tangible computer-readable medium. The histogram can then be stored for the database on a tangible computer-readable medium by encoding the approximation with the precision.

BACKGROUND

Histograms are used in databases as lossily compressed representationsof the statistics of the data resident in a table. The statisticsobtained from the histograms are used for query optimization and in somecases, approximate query processing.

Database modules directed to query optimization tasks often utilizeestimates of query result sizes. For example, query optimizers selectthe most efficient access plan for a query based on estimated costs.These costs can be in turn based on estimates of intermediate resultsizes. Sophisticated user interfaces also use approximations of resultsizes as feedback to a user before a query is actually executed. Suchfeedback helps to detect errors in queries or misconceptions about thedatabase. However, these statistics merely approximate the distributionof data values in attributes of the relations, and often are based onassumptions, such as a uniform distribution of attribute values, thatoften do not hold. Therefore, the statistics can represent an inaccuratepicture of the actual contents of the database.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram illustrating a method according to an exampleof the present technology;

FIG. 2 is a flow diagram illustrating computing steps in a particularexample of the method;

FIG. 3 is a flow diagram illustrating computing steps in anotherparticular example of the method; and

FIG. 4 is a schematic diagram of a system according to an example of thepresent technology.

DETAILED DESCRIPTION

Reference will now be made to the examples illustrated, and specificlanguage will be used herein to describe the same. Features andadvantages of the technology will be apparent from the detaileddescription which follows, taken in conjunction with the accompanyingdrawings, which together illustrate, by way of example, features of thetechnology.

Examples are presented below for constructing and storingrepresentations of statistical aspects of data in a database or otherrelational construct. A particular example is directed to constructionof a histogram of an attribute in a database column given a storageconstraint on the histogram. As used herein, the term “column” refers toa structure of a database, where said database can be considered to beorganized as a table with rows and columns. In a particular example, adatabase comprises records organized in rows and including categories ofdata, where the data in a category constitutes a column. However, itshould be understood that the discussion herein is not limited to aparticular relational database system.

For purposes of the subject technology, “attributes” can include integerdata as well as non-numerical data (e.g. text) that has been convertedinto an integer code and stored in the column. Histograms typicallypartition the set of attribute values into “buckets” (i.e. histogramcomponents) comprising one or more boundaries that define a range ofpotential attribute values. The number of occurrences in the column ofeach of the attribute values (i.e. value frequency) falling within theboundaries of a bucket can be represented by a single rational number,referred to herein as a “count” or “bucket count”.

A histogram can be used to estimate the likely result size for aparticular query of the database. For example, to estimate the resultsize of a predicate having a range, an estimation routine can beemployed to identify each bucket having boundaries that intersect withthat range. Then, using specified approximation formulas, the routineestimates the number of values in each identified bucket that satisfythe range predicate, along with the frequency of each such value. Thesefrequencies are summed over the identified buckets to yield the estimateof the result size.

One challenge presented by histogram construction for database queryoptimization is the construction of a representation of the data withina given amount of storage space so that relevant statistics can begleaned with minimal error. While considerable effort has been spent inoptimizing the boundaries of the buckets, much less attention has beenpaid to optimizing the number of bits that are spent in representing thebucket count in storage space. The latter consideration can be equallycrucial, in that the ultimate constraint on the histogram typically isavailable storage space. For example, in a large database (i.e. having alarge number of attributes, n) obtaining a representation of the datathat both exhibits low error and can be stored in a limited space canpresent a challenge.

As shown in FIG. 1, one example of a method 100 for constructing ahistogram can include a step 102 of sampling attributes in a column of adatabase on a server, and then a step 104 of determining a bucket setfor the histogram based on a number of buckets that represents adistribution of the attributes with minimum error. The buckets in thebucket set include boundaries and an also an approximation of a count ofattributes falling within the boundaries. The method can further includethe step 106 of determining a precision for encoding the approximation,such that the histogram having the bucket set fits within a storagelimit, e.g. for storage on a tangible computer-readable medium. Thehistogram can then be stored 108 for the database on a tangiblecomputer-readable medium by encoding the approximation with theprecision. Techniques are further outlined below to trade-off betweenthe number of bits used for representing each bucket count and the totalnumber of buckets used, so as to minimize the error in the histogramrepresentation while adhering to the storage constraint.

Once constructed, a histogram can be stored for use in query-relatedoperations. For example, a database system catalog can include one ormore histograms constructed from tables in the database. The histogramcan be reconstructed or otherwise updated to reflect changes indatabase. Updating can be done according to a schedule appropriate tothe database. Storage of the histogram can involve storing particulardescriptive values associated with the histogram and the underlyingstatistics. For example, a histogram can be represented in storage byvalues for at least one boundary for each bucket and a representation ofthe attribute counts in each bucket. All buckets and bucket counts canbe included, or alternatively only non-zero counts and their associatedbuckets may be included. According to an example, the representation ofthe bucket count can be a single-value scalar representation, such as amean (e.g. arithmetic mean, geometric mean), a median or a mode. In aparticular example, the frequencies in a bucket are represented instorage by an average of the frequencies. Other bucket properties can bestored using sufficient information to characterize the histogram whileminimizing size. For example, boundaries of buckets can be representedusing either the upper boundary or lower boundary of each bucket.

In one example, a technique can comprise constructing a histogram thatrepresents the bucket counts up to a certain fixed precision. Limitingcount precision can introduce further error in the representation of thecounts of the values. However, the storage savings obtained in thenumber of bits needed to represent the bucket counts can be used toincrease the number of buckets and thereby reduce the overall error.

An attribute occurs in a column of a database. A method for constructinga histogram can include sampling attributes in the column of thedatabase on a server and determining a bucket set for the histogrambased on a number of buckets that represents a distribution of theattributes with minimum error. Let U={1, 2, . . . k}, denote the orderedset of values that an attribute in question can take. This abstractionallows for instances where the attribute values are not just integers,but arise from any set that permits a total ordering. Such anabstraction is particularly useful when the queries in question arerange queries. Let n denote the number of rows in the column. Let vector

=(c₁, c₂, . . . c_(k))ε

^(k) such that

${c_{i} \geq {\sum\limits_{i = 1}^{k}c_{i}}} = n$denote the number of times each value appears in the column (the integerc_(i) is the number of times i appears in the column). In one example, ahistogram is an approximate representation of c in which the set U ispartitioned into a collection of non-overlapping intervals, i.e.buckets. Let the bucket boundaries for the B buckets be 1≦b₁<b₂< . . .b_(B−1)<b_(B)=k. For 1≦j≦B, the jth bucket, denoted by

_(j), is the set of values defined by the boundaries, i.e. {b_(j−1)+1,b_(j−1)+2, . . . b_(j)}. Once the bucket boundaries are established, thecounts of the attribute values in each bucket are approximated by asingle rational number. For example, for all iε

_(j), the counts c_(i) can be approximated by a single rational numberĉ_(j). Accordingly, the approximate histogram representation

is a vector of length k obtained by replacing c_(i) (where 1≦i≦k and isa member of

_(j)) in

by ĉ_(j). In a particular example, ĉ_(j) is set to be the arithmeticmean of the counts of all the values in bucket j. However, other choicessuch as the median or mode may be used. The quantity ĉ_(j) can then bestored as a representative approximation of the bucket count in bucketj.

The error in the histogram representation can be measured in severalways. For example, a squared error metric can be used as follows. Forany two length-k vectors x=(x₁, x₂, . . . , x_(k)), y=(y₁, y₂, . . . ,y_(k)), the error in representing x with y is given by

$\begin{matrix}{{d( {x,y} )} = {\sum\limits_{i = 1}^{k}( {x_{i} - y_{i}} )^{2}}} & \lbrack {{Eq}.\mspace{14mu} 1} \rbrack\end{matrix}$

This is the average squared error incurred in the case of a uniformdistribution on all possible point queries. Other choices of errormetrics such as weighted squared error, absolute error and worst-caseerror can be accommodated in the following exposition. For purposes ofthe following discussion, the present technology is described for thecase of squared error. In one aspect, an outcome of the method comprisesconstructing

while minimizing the squared error loss d(

,

) while complying with a given storage constraint. In another aspect, anoutcome is minimizing the storage space needed to store

while complying with a constraint on the error metric. The space used tostore

depends on the precise coding scheme employed. As noted above, oneapproach for storing

is to store the bucket boundaries b₁, b₂, . . . , b_(B−1) and the valuesof the counts for each bucket, namely {

}_(j=1) ^(B). In a particular example, it can suffice to store oneboundary for each bucket, e.g. the upper boundary or alternatively alower boundary. Accordingly, the histogram representation can includeB−1 boundaries, and to represent the bucket boundaries calls for at most

$\begin{matrix}{{\log\begin{pmatrix}{k + B - 1} \\{B - 1}\end{pmatrix}} \leq {( {B - 1} )\log\; k}} & \lbrack {{Eq}.\mspace{14mu} 2} \rbrack\end{matrix}$bits (note that the notation “log” as used herein denotes a base 2logarithm). The left hand side is the fewest number of bits necessaryfor a lossless representation that uses a fixed-length coding scheme. Afixed-length coding scheme uses the same number of bits to represent allpossible combinations of bucket numbers and boundaries. The right handside is achieved by representing each of the B−1 bucket boundaries usinglog k bits. The discussion henceforth assumes use of this coding scheme.In accordance with this scheme, given a total storage budget of R bits,the number of bits available for storing the approximate counts{ĉ_(j)}_(j=1) ^(B) is R−(B−1) log k bits. A common approach is torepresent the approximate counts exactly and choose the bucket quantityB so as to meet the storage constraint of R bits. However, in accordancewith the examples described herein one can choose encoding schemes forthe approximate counts while optimizing the bucket quantity B and thebucket boundaries so as to minimize d(

,

). For example, the method can include determining a precision forencoding the approximation of counts, such that the histogram having thebucket set fits within a storage limit.

In one example, each bucket count ĉ_(j) can be represented up to a givenprecision (denoted Δ with respect to this example). In a particularaspect, let ĉ_(j)=m_(j)Δ, where m_(j) is a non-negative integer. Letr=┌n/Δ┐+1 denote the maximum number of values that m_(j) can take. Then,where {ĉ_(j)}_(j=1) ^(B) is represented by describing each m_(j)separately, the number of bits needed is B log r. From Eq. 2, the totalnumber of bits needed to represent

is (B−1) log k+B log r. With only R bits available, the space constrainton storage of

is(B−1)log k+B log r≦R.  [Eq. 3]

In an aspect of the example, d(

,

) is minimized subject to the constraint of Eq. 3. Where ĉ is ahistogram having a bucket set {

_(j)} comprising a quantity B of buckets

$\begin{matrix}{{d( {c,\hat{c}} )} = {\sum\limits_{j = 1}^{B}{\sum\limits_{i \in \beta_{j}}( {c_{i} - {\hat{c}}_{j}} )^{2}}}} & \lbrack {{Eq}.\mspace{14mu} 4} \rbrack\end{matrix}$where ĉ_(j)=m_(j)Δ for some integer 0≦m_(j)≦r. The precision used torepresent each count is a source of error in the histogramrepresentation, and is accordingly accounted for in the present method.Accordingly, minimizing d(

,

) involves a constrained minimization problem that can be expressed as

$\begin{matrix}{\min_{B,\Delta,{{\{\beta_{j}\}};{{{{({B - 1})}\log\; k} + {B\;\log\; r}} \leq R}}}{\sum\limits_{j = 1}^{B}{\sum\limits_{i \in \beta_{j}}{( {c_{i} - {\hat{c}}_{j}} )^{2}.}}}} & \lbrack {{Eq}.\mspace{14mu} 5} \rbrack\end{matrix}$It is noted that the minimization space as described includes thestandard histogram representation where Δ=2⁻⁶⁴ and log r is effectively64 bits when floating point precision is used.

In a specific example, the approximation of counts in a bucket can bebased on an arithmetic mean of the counts. For example, for all 1≦j≦B,let

$\begin{matrix}{{\overset{\_}{c}}_{j}\overset{def}{=}{\frac{1}{\beta_{j}}{\sum\limits_{i \in B_{j}}^{\;}\; c_{i}}}} & \lbrack {{Eq}.\mspace{14mu} 6} \rbrack\end{matrix}$denote the mean of the counts in bucket j. Further, let m_(j) equal theclosest integer to

$\frac{{\overset{\_}{c}}_{J}}{\Delta}.$Then for any x

$\begin{matrix}\begin{matrix}{{\sum\limits_{i \in \beta_{j}}^{\;}( {c_{i} - x} )^{2}} = {\sum\limits_{i \in \beta_{j}}^{\;}( {c_{i} - {\overset{\_}{c}}_{j} + {\overset{\_}{c}}_{j} - x} )^{2}}} \\{= {{\sum\limits_{i \in \beta_{j}}^{\;}( {c_{i} - {\overset{\_}{c}}_{j}} )^{2}} + {\sum\limits_{i \in \beta_{j}}^{\;}( {{\overset{\_}{c}}_{j} - x} )^{2}}}} \\{= {{\sum\limits_{i \in \beta_{j}}^{\;}( {c_{i} - {\overset{\_}{c}}_{j}} )^{2}} + {{\beta_{j}}( {{\overset{\_}{c}}_{j} - x} )^{2}}}}\end{matrix} & \lbrack {{Eq}.\mspace{14mu} 7} \rbrack\end{matrix}$where the second equality follows from the definition of c _(j).Therefore Eq. 5 can be reduced to

$\begin{matrix}{\min_{B,\Delta,{{\{\beta_{j}\}};{{{{({B - 1})}\log\; k} + {B\;\log\; r}} \leq R}}}{\sum\limits_{j = 1}^{B}{( {{\sum\limits_{i \in \beta_{j}}( {c_{i} - {\overset{\_}{c}}_{j}} )^{2}} + {{\beta_{j}}( {{\overset{\_}{c}}_{j} - {\overset{\_}{c}}_{j}} )^{2}}} ).}}} & \lbrack {{Eq}.\mspace{14mu} 8} \rbrack\end{matrix}$Instead of optimizing over both B and Δ simultaneously, one can make asimplifying assumption that the optimal Δ is likely to be small relativeto n, and that the fractional part of c _(j)/Δ is approximatelyuniformly distributed between 0 and 1. This assumption allows Δ to befixed as a function of B, so that there is a particular (delta)associated with each possible B. This can reduce the optimization spaceby allowing each B to be evaluated with respect to the associatedprecision and to the storage constraint.

For example, for a given B, let Δ*(B), r*(B) denote the minimum value ofΔ and the corresponding maximum value of r that satisfy constraint ofEq. 3. The following simpler minimization problem is presented:

$\begin{matrix}{\min_{B,{\{\beta_{j}\}}_{j = 1}^{B}}{\sum\limits_{j = 1}^{B}{( {{\sum\limits_{i \in \beta_{j}}^{\;}( {c_{i} - {\overset{\_}{c}}_{j}} )^{2}} + {{\beta_{j}}( {{\overset{\_}{c}}_{j} - {m_{j}{\Delta^{*}(B)}}} )^{2}}} ).}}} & \lbrack {{Eq}.\mspace{14mu} 9} \rbrack\end{matrix}$

For a given number of buckets B, the part of Eq. 9 that is

$\min\limits_{B,{\{\beta_{j}\}}_{j = 1}^{B}}{\sum\limits_{j = 1}^{B}( {\sum\limits_{i \in \beta_{j}}^{\;}( {c_{i} - {\overset{\_}{c}}_{j}} )^{2}} )}$has been previously addressed using dynamic programming methods.However, such an approach does not accommodate the additional errorarising as a result of representing c _(j) to a limited precision. Thepresent technology employs a modified dynamic programming method thataccommodates this error. The modified method can be described asfollows.

For any interval bounded by a and b, where 1≦a≦b≦k, let a Modified SumSquared Error (MSSE) be defined thusly:

$\begin{matrix}{{{MSSE}( \lbrack {a,b} \rbrack )}\overset{def}{=}{{\sum\limits_{i = a}^{b}( {c_{i} - {{AVG}( \lbrack {a,b} \rbrack )}} )^{2}} + {( {b - a + 1} )^{2}( {{{{AVG}( {\lbrack {a,b} \rbrack - {\lbrack \frac{{AVG}( \lbrack {a,b} \rbrack }{\Delta} \rbrack\Delta}} )}^{2}\mspace{79mu}{where}\mspace{79mu}{{AVG}( \lbrack {a,b} \rbrack )}}\overset{def}{=}{\frac{1}{b - a + 1}{\sum\limits_{i = a}^{b}\; c_{i}}}} }}} & \lbrack {{Eq}.\mspace{14mu} 10} \rbrack\end{matrix}$is the average of {c_(i)}_(i=a) ^(b). One could pre-compute

$\sum\limits_{i = 1}^{p}{c_{i}\mspace{14mu}{and}\mspace{14mu}{\sum\limits_{i = 1}^{p}c_{i}^{2}}}$for all values of p. This costs O(k) time. This upfront computationalcost allows one to compute MSSE([a,b]) and AVG([a,b]) for any a,b inconstant time. Let MSSE*(i,p) denote the minimum sum squared error inthe histogram representation of (c₁, c₂, . . . , c_(i)) with p bucketsand Δ*(B) as the precision (B is implicit in the notation). In seekingMSSE*(k,B) a principal observation is thatMSSE*(i,p)=min_(1≦i) ₁ _(≦i){MSSE*(i,p−1)+MSSE([i ₁+1,i])}.  [Eq. 11]To calculate MSSE*(k,B), Equation 11 above can be used to computeMSSE*(i,p) for all 1≦i≦k and all 1≦p≦B, in increasing order of p and foreach p in increasing order i. In one aspect, this can includemaintaining an error table to store the results of the computation. In aspecific example, a table of size kB is maintained, where O(k)computations are performed to compute each entry. Then Equation 9 can besolved by repeating the procedure for 1≦B≦n with the appropriate Δ ineach case and then computing min_(B) MSSE*(k, B). An example method 200to execute the above approach is summarized in FIG. 2.

In another example, rather than a single bucket count, a technique canapproximate the count vector in a bucket by using another vector, onethat belongs to a pre-determined finite set of vectors. This approach isakin to vector quantization, whereas the above technique is analogous toscalar quantization. In a particular example, the vectors used can comefrom a pre-determined finite set of dimension |

_(j)|. The pre-determined finite set of vectors can in particular bederived from lattices. In one aspect, the set of vectors is a subset ofpoints in a lattice. In such cases, lattice encoding and decodingmethods may be employed to identify the vector closest to the countvector being quantized.

According to this example, a lattice in m-dimensions can be defined by anon-singular m×m matrix M so that if v is an m-dimensional vector ofintegers, the lattice Λ is the set of all vectors of the form Mv.However for purposes of practicing the present method, thenon-singularity of Mmay be relaxed. In a particular aspect, the latticerepresents a set of regularly spaced points in m dimensions. It shouldbe noted that any scaling of the lattice (e.g. according to a scalingparameter a) results in another lattice where the points are similarlyregularly spaced. The spacing in the scaled lattice is larger or smallercompared to the original lattice depending on whether α is greater orlesser than 1. Lattices are used in vector quantization when a lossyrepresentation of a real-valued vector is sought.

Given a partition of the column {1, 2, . . . , k} into B buckets {

_(j)}, the vector of counts c can be approximated using lattices asfollows. Let {Λ_(i)}_(i=1) ^(k) denote a pre-determined set of latticeswhere Λ_(i) is a lattice of dimension i. It should be noted that for allscaling parameters α, αΛ_(i) is also a lattice. Given n and a scalingparameter α, the subset of the lattice αΛ_(i) that is a validrepresentation of a count vector of length i, denoted {tilde over(Λ)}_(i), is defined{tilde over (Λ)}_(i) ={v=(v ₁ ,v ₂ , . . . ,v _(i))εαΛ_(i) :v _(i)≦0 andΣ_(l) v _(l) ≦n}.  [Eq. 12]A greater α will result in fewer points being in {tilde over (Λ)}_(i)and therefore fewer bits can be used to represent a given point. Thus atrade-off can be made between the number of bits assigned to representthe counts in a bucket and the number of buckets by adjusting α andselecting B to obey a storage constraint. Alternatively a similartrade-off can be made by selecting α with regard to the storageconstraint and adjusting B to also meet the constraint. Thus, α canserve an analogous role as a precision factor in this example to that ofΔ described above.

In accordance with the example, let b₁, b₂, . . . b_(B−1) denote thebucket boundaries and let L₁, L₂, . . . , L_(B) denote the sizes of thebuckets used to partition {1, 2, . . . , k}. For 1≦j≦B, let

_(j)

(c_(b) _(j−1) ₊₁, c_(b) _(j−1) ₊₂, . . . , c_(b) _(j) ) denote thevector of counts corresponding to bucket j. Then for a given α, theapproximate representation of vector

is given by

=(

₁,

₂, . . . ,

_(B)) where

j = arg ⁢ ⁢ min v ∈ Λ ~ L j ⁢  v - j  2 2(with ∥·∥₂ denoting a squared-error norm).

Once the bucket boundaries have been specified, the number of bitsneeded to represent the approximation is log r(α, {L_(j)}) wherer(α,{L _(j)})

Π_(j=1) ^(B)|{tilde over (Λ)}_(L) _(j) |  [Eq. 13]with |{tilde over (Λ)}_(L) _(j) | denoting the cardinality of latticesubsets that approximate the count vector having a length correspondingto the size of bucket j.

As discussed above, for a histogram having B buckets has a storageconstraint such as shown in Eq. 2. Instead of optimizing B and αsimultaneously, α can be set at α*(B), which denotes the minimum valuethat satisfies the storage constraint for each B. In a particularexample,

$\begin{matrix}{{\overset{\_}{r}( {\alpha,B} )}\overset{def}{=}{\max_{{{\{ L_{j}\}}_{j = 1}^{B}\text{:}\mspace{14mu}\sum\limits_{L_{j} = k}^{\;}}\mspace{11mu}}{r( {\alpha,\{ L_{j} \}} )}}} & \lbrack {{Eq}.\mspace{14mu} 14} \rbrack \\{and} & \; \\{{\alpha^{*}(B)} = {\min{\{ {{{\alpha\text{:}\mspace{14mu}( {B - 1} )\log\; k} + {B\;\log\;{\overset{\_}{r}( {\alpha,B} )}}} \leq R} \}.}}} & \lbrack {{Eq}.\mspace{14mu} 15} \rbrack\end{matrix}$This value of α allows all possible partitions of {1, 2, . . . , k} intoB buckets. As such, when {

_(j)} is specified this value may be somewhat conservative. However thisapproach allows for separation of the problem of optimizing the scalingfactor from that of identifying the bucket boundaries. It should also benoted that, given n and k, all values α*(B) can be computed in advancefor all values of B. In a further aspect, the values of α*(B) can bestored in a precision table. In that case this step will not contributeto the computation time for generating the histogram.

The error in representing count vector c in this manner is

$\begin{matrix}{{{d( {c,\hat{c}} )} = {\sum\limits_{j = 1}^{B}{{c_{j} - {\hat{c}}_{j}}}_{2}^{2}}},} & \lbrack {{Eq}.\mspace{14mu} 16} \rbrack\end{matrix}$where for 1≦j≦B,

_(j) is the sub-vector of counts corresponding to the jth bucket and

_(j) is the vector used to approximate that sub-vector. In the presentmethod, a histogram is constructed in which this error is minimized,i.e.

$\begin{matrix}{\min_{B}{\min_{{\{\beta_{j}\}}_{j = 1}^{B}}{{d( {c,\hat{c}} )}.}}} & \lbrack {{Eq}.\mspace{14mu} 17} \rbrack\end{matrix}$In one aspect, implicit in the inner minimization is that the acorresponding to the lattices used is α*(B). Another aspect is that theα is independent of the bucket boundaries.

The inner minimization can be performed as follows. For any intervalbounded by a and b, where 1≦a≦b≦k, let

(a,b) denote the sub-vector (c_(a), c_(a+1), . . . c_(b)). Let a ClosestLattice Vector (CLV) be defined asCLV([a,b])

arg min_(vε{tilde over (Λ)}) _(b−a+1) ∥

−

(a,b)∥₂ ²,  [Eq. 18]i.e. the vector in the lattice {tilde over (Λ)}_(b−a+1) that is closestto c(a,b) in terms of the squared-error norm. Further let a Sum SquaredLattice Error(SSLE) be defined asSSLE([a,b])

min_(vε{tilde over (Λ)}) _(b−a+1) ∥

−

(a,b)∥₂ ²,  [Eq. 19]i.e. the corresponding minimum squared-error. In contrast to MSSE([a,b])and AVG([a,b]), here O(k³) is used to compute CLV([a,b]) and SSLE([a,b])for all a,b. These can be found using known methods for finding theclosest lattice point. Let MSSE*(i,p) denote the minimum sum squarederror in a histogram representation of (C₁, c₂, . . . c_(i)) with pbuckets and α*(B) as the scaling factor. In seeking MSSE*(k,B) it isobserved thatMSSE*(i,p)=min_(1≦i) ₁ _(≦i){MSSE*(i,p−1)+SSLE*([i ₁+1,i])}.  [Eq. 20]To calculate MSSE*(k,B), Eq. 20 can be used to compute MSSE*(i,p) forall 1≦i≦k and all 1≦p≦B, in increasing order of p and for each p inincreasing order i. In one aspect, this can include maintaining a tableto store the results of the computation. In a specific example, a tableof size kB is maintained, where O(k) computations are performed tocompute each entry. Then Equation 17 can be solved by repeating theprocedure for 1≦B≦n with the appropriate α*(B) in each case and thencomputing min_(B) MSSE*(k, B). An example method 300 to execute theabove approach is summarized in FIG. 3.

Summarizing and reiterating to some extent, technologies describedherein provide a way to represent the distribution of an attribute in adatabase, while both minimizing representational error and meeting astorage constraint. It is contemplated that such methods can be used inthe processing of data in a database, particularly in enhancing queryprocessing. It should be noted that all or portions of the subjecttechnology can be implemented as a system, method, apparatus, or articleof manufacture using standard programming and/or engineering techniquesto produce software, firmware, hardware or any combination thereof tocontrol a computer to implement the disclosed innovation.

The technology described here can be stored on a computer readablestorage medium that includes volatile and non-volatile, removable andnon-removable media implemented with any technology for the storage ofinformation such as computer readable instructions, data structures,program modules, or other data. In one example, a computer-readablestorage medium can have tangibly stored thereon computer-readableinstructions for execution of method for histogram construction asdescribed herein. Computer readable storage media include, but is notlimited to, RAM, ROM, EEPROM, flash memory or other memory technology,CD-ROM, digital versatile disks (DVD) or other optical storage, magneticcassettes, magnetic tapes, magnetic disk storage or other magneticstorage devices, or any other computer storage medium which can be usedto store the desired information and described technology.

Systems utilizing the methods described herein may also includecommunication connections or networking apparatus and networkingconnections that allow devices in the system to communicate with otherdevices. Communication connections are an example of communicationmedia. Communication media typically embodies computer readableinstructions, data structures, program modules and other data in amodulated data signal such as a carrier wave or other transportmechanism and includes any information delivery media. The “modulateddata signal” means a signal that has one or more of its characteristicsset or changed in such a manner as to encode information in the signal.By way of example, and not limitation, communication media includeswired media such as a wired network or direct-wired connection, andwireless media such as acoustic, radio frequency, infrared, and otherwireless media. The term computer readable media as used herein includescommunication media.

In particular, the instructions can be provided for execution by aprocessor in a system, such as a query processor. For example, a systemfor processing data in a database can include a database in which aplurality of attributes is arranged in one or more columns. The systemcan further comprise one or more processors, particularly a queryprocessor for querying the column according to some predicate.Computer-readable instructions can be provided for execution by thequery processor to perform a mapping of the distribution of attributesin accordance with the methods described herein. In a particular aspect,the instructions are provided on a computer-readable medium.

In one example of a system for processing data in a database, describedin FIG. 4, a system 400 can comprise a server 402 as well as a hardwareprocessor device 404, a hardware memory device 406, a localcommunication bus 408 to enable communication between hardware devicesand components, and a networking device 410 for communication across anetwork. The server can include a database 412 containing structureddata 414, and a query processor 416. The query processor can access thedatabase in response to a query input by a user and execute instructionsto create one or more histograms from which an approximate response tothe query can be provided. Once created, the count approximations areencoded using a precision according to the method and stored with thebucket set for the database, e.g. on a computer-readable storage medium.The approximations can be stored on the medium with the database, oralternatively on a storage medium external to the database.

While the forgoing examples are illustrative of the principles of thepresent technology in one or more particular applications, it will beapparent to those of ordinary skill in the art that numerousmodifications in form, usage and details of implementation can be madewithout the exercise of inventive faculty, and without departing fromthe principles and concepts of the innovation. Accordingly, it is notintended that the innovation be limited, except as by the claims setforth below.

The invention claimed is:
 1. A method for constructing a histogram,comprising: sampling attributes in a column of a database on a server;determining a bucket set comprising a number of buckets that representsa distribution of the attributes with minimum error, wherein at leastone bucket includes boundaries and an approximation of a count ofattributes falling within the boundaries; determining a precision forencoding the approximation, wherein a histogram having the bucket setfits within a storage limit and wherein the precision includes a bitthreshold for the approximation of the count of the attributes; andstoring the histogram for the database by encoding the approximationwith the precision for encoding the approximation.
 2. The method ofclaim 1, wherein determining the precision for encoding theapproximation comprises: computing a precision value and an error metricfor a bucket set; storing the precision value and the error metric in atable; and retrieving the precision for encoding the approximation foruse in storing the histogram.
 3. The method of claim 2, wherein theerror metric is a sum squared error.
 4. The method of claim 1, whereinthe approximation is a scalar representation of the count of attributes.5. The method of claim 4, wherein the precision for encoding theapproximation is such that${( {B - 1} )\log_{2}\; k} + {B\;\log_{2}\lceil ( {\frac{n}{\Delta} + 1} ) \rceil}$is no greater than the storage limit, where Δ is a precision value, B isthe number of buckets, (and where n attributes can take on any of kvalues) k is a number of possible values the attributes, n is the numberof attributes.
 6. The method of claim 1, wherein the approximation is avector representation of the count of attributes.
 7. The method of claim6, wherein the approximation is a vector selected from a pre-defined setof vectors.
 8. The method of claim 6, wherein the approximation is avector from a lattice {tilde over (Λ)}_(i) that approximates a countvector with minimum error.
 9. The method of claim 8, wherein theprecision for encoding the approximation ismin{α:(B−1)log k+B log r (α,B)≦R}, where r(α,B) is the number of pointsin the lattice that can approximate the count vector, α is a scalingfactor for lattice {tilde over (Λ)}_(i), B is the number of buckets, kis a number of possible values for the attributes, and R is the storagelimit.
 10. The method of claim 1, wherein the attributes are created byconverting text data into integers.
 11. A non-transitorycomputer-readable storage medium having tangibly stored thereoncomputer-readable instructions for execution by a processor to perform amethod of mapping a distribution of attributes in a database,comprising: determining a bucket set comprising a number of buckets thatrepresents the distribution with minimum error, wherein each bucketcontains an approximation of a count of attributes; determining aprecision for which a histogram comprising the bucket set fits within astorage limit, wherein the precision includes a bit threshold for theapproximation of the count of attributes; and storing the histogram byencoding the approximation with the precision.
 12. The non-transitorycomputer-readable storage medium of claim 11, wherein determining theprecision comprises computing a precision value and an error metric fora bucket set and storing the precision value and the error metric in atable; retrieving the precision for storing the histogram.
 13. Thenon-transitory computer-readable storage medium of claim 12, wherein theerror metric is a sum squared error.
 14. The non-transitorycomputer-readable storage medium of claim 11, wherein the approximationis a scalar representation of the count of attributes.
 15. Thenon-transitory computer-readable storage medium of claim 14, wherein theprecision is such that${( {B - 1} )\log_{2}\; k} + {B\;\log_{2}\lceil ( {\frac{n}{\Delta} + 1} ) \rceil}$is no greater than the storage limit, where Δ is a precision value, B isthe number of buckets, (and where n attributes can take on any of kvalues) k is a number of possible values the attributes, and n is thenumber of attributes.
 16. The non-transitory computer-readable storagemedium of claim 11, wherein the approximation is a vector representationof the count of attributes.
 17. The non-transitory computer-readablestorage medium of claim 16, wherein the approximation is a vectorselected from a pre-defined set of vectors.
 18. The non-transitorycomputer-readable storage medium of claim 17, wherein the approximationis a vector from a lattice {tilde over (Λ)}_(i) that approximates acount vector with minimum error.
 19. The non-transitorycomputer-readable storage medium of claim 18, wherein the precisionmeetsmin{α:(B−1)log k+B log r (α,B)≦R} where r(α,B) is the number of pointsin the lattice that can approximate the count vector, α is a scalingfactor for lattice {tilde over (Λ)}_(i), B is the number of buckets, kis a number of possible values for the attributes, and R is the storagelimit.
 20. A system for processing data in a database, comprising: adatabase that includes plurality of attributes arranged in a column; aquery processor to query the column; a computer-readable storage mediumhaving stored thereon computer-readable instructions for execution bythe query processor to perform a method of mapping a distribution ofattributes in a database, said method comprising: determining a bucketset comprising a number of buckets that represents the distribution withminimum error, wherein each bucket contains an approximation of a countof attributes; determining a precision for which a histogram comprisingthe bucket set fits within a storage limit, wherein the precisionincludes a bit threshold for the approximation of the count ofattributes; and storing the histogram by encoding the countapproximation with the precision.